Skip to content

Fix HTML annotation cleanup and Trafilatura evaluation#70

Merged
e06084 merged 1 commit into
opendatalab:mainfrom
e06084:fix-issue-68-trafilatura-metrics
Jun 13, 2026
Merged

Fix HTML annotation cleanup and Trafilatura evaluation#70
e06084 merged 1 commit into
opendatalab:mainfrom
e06084:fix-issue-68-trafilatura-metrics

Conversation

@e06084

@e06084 e06084 commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • strip browser annotation artifacts before extractor runs while preserving wrapped text
  • align Trafilatura defaults with upstream and make the txt variant use trafilatura.extract()
  • pass metric_config into metrics so LLM-enhanced formula splitting is actually used
  • add a 545-sample leaderboard reproduction script and refresh README results

Validation

  • python -m py_compile examples/run_545_leaderboard.py webmainbench/evaluator/evaluator.py webmainbench/extractors/base.py webmainbench/extractors/trafilatura_extractor.py webmainbench/extractors/trafilatura_txt_extractor.py webmainbench/metrics/base.py webmainbench/metrics/base_content_splitter.py webmainbench/metrics/calculator.py webmainbench/utils/html_cleaner.py tests/test_html_cleaner.py tests/test_metric_config.py tests/test_trafilatura_config.py
  • reran 545-sample baseline rows with LLM metric splitting enabled

@e06084 e06084 merged commit 9d991bd into opendatalab:main Jun 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant